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ABSTRACT 

Metal Detector identifies CYS and HIS involved in 
transition metal protein binding sites, starting from 
sequence alone. A major new feature of release 2.0 
is the ability to predict which residues are jointly 
involved in the coordination of the same metal ion. 
The server is available at http://metaldetector.dsi 
.unifi.it/v2.0/. 

INTRODUCTION 

Metalloproteins are a large and diverse class of proteins 
which bind one or more metal ions in their native con- 
formation (1). Metal atoms play a wide range of struc- 
tural, regulatory or catalytic roles which are critical to 
protein function (2). Zinc ions contribute, for instance, 
to stabilize the structure of a huge number of transcription 
factors such as zinc fingers. Enzymes often employ metal 
ions as cofactors in their catalytic sites (3). Metal binding 
proteins are implicated in heavy metal toxicity, in 
processes such as apoptosis (4), aging (5) and carcinogen- 
esis (6). Identifying metal binding sites in novel proteins 
can significantly contribute to their functional character- 
ization, as well as help in understanding metal-related 
malfunctions. 

X-ray absorption spectroscopy (HT-XAS) has recently 
proved capable of identifying metalloproteins with high 
reliability (7,8). However, the specific ligands involved 
in binding the metal ion(s) cannot be identified by these 
techniques. Bioinformatics tools can significantly contrib- 
ute to a detailed annotation of metal binding sites, as well 
as in scaling-up to proteome-wide analyses. Motif-based 
approaches, relying on regular expression patterns or 
Pfam probabilistic models, have been employed (9) for 
sequence-based predictions on entire proteomes. The 
drawback of these methods is that they cannot identify 
novel sites: regular expression patterns tend to be quite 
specific but with low coverage (many false negatives), 
and Pfam models are limited to known metal-binding 
domains. In order to overcome these limitations, a 



number of supervised learning techniques [e.g. 
(10,11,12)] have been recently developed for predicting 
the metal bonding state of all residues in a sequence. 
The task consists of discriminating between free and 
metal-bonded residues (or disulfide bonded for cysteines). 

MetalDetector (13) predicts metal-bonding state of 
CYS and HIS residues, focusing on transition metals, 
heme and Fe/S groups as candidate heterogens. The 
system has been active since April 2008 and has served 
roughly 10000 queries so far. It was recently (8) 
employed in combination with HT-XAS in order to 
identify putative metal binding sites in a large set of 
protein targets generated within the Protein Structure 
Initiative (http : // www. structuralgenomics . org) . 

Identification of binding sites geometry is the main new 
feature of release 2.0 presented in this article. The task 
consists in predicting the number of ions binding the 
protein together to their respective sets of ligands in the 
sequence. Figure 1 shows an example of a protein kinase C 
cystein-rich domain (PDB entry ltbn). It highlights the 3D 
structure of the binding sites (top) and a graph-based rep- 
resentation of the input sequence together to the desired 
output (bottom). These predictions can have a significant 
impact in a number of tasks, including: detailed functional 
annotation of experimentally unsolved proteins, e.g. char- 
acterization of active sites in enzymes, many of which 
employ metal ions as cofactors (3); experimental determin- 
ation of new metalloproteins, as the prediction of metal 
binding sites can guide the preparation of samples for 
in vitro studies (7). 

There exist several web servers for metal-binding sites 
prediction. DiANNA (10) predicts cysteine-bonding states 
only, while it is not able to reconstruct metal-binding site 
geometry; MetSite (14) identifies sites using sequence 
profile information in combination with approximate 
structural data coming from low-resolution (or predicted) 
models; FINDSITE-metal (15) predicts metal-binding 
sites from evolutionarily related templates detected by 
threading; Feature (16) identifies zinc-binding sites for 
proteins whose 3D structure is given. The applicability 
of these web servers is thus limited to structurally 
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determined proteins, or proteins for which a reasonable 
3D model can be derived. SeqCHED (17) is a recently 
developed server predicting metal binding geometry from 
protein sequence, which relies on remote homology detec- 
tion to create a structural model of the target protein, over 
which the original CHED (18) structure-based algorithm 
is applied. It thus cannot predict metal binding sites 
for proteins having novel folds. Similar limitations hold 
for the up-mentioned pattern-based or domain-based 
approaches. MetalDetector2 is the first server capable of 
predicting metal binding geometry for novel folds starting 
from sequence information alone. 



MATERIALS AND METHODS 

Overview 

There are two crucial aspects concerning prediction 
of metal binding geometry. First, the number of admis- 
sible configurations can be extremely large. For a pro- 
tein chain with n CYS and HIS (candidate ligands), 
m ions and k t ligands for the z-th ion, the number of 
configurations is the multinomial coefficient 
n\/k\\ki\ ■ ■ ■ k m l(n — — k„,)\. In practice, each ion 
is coordinated by a variable number of ligands (typically 
ranging from 1 to 4, but occasionally more), and each 
protein chain binds a variable number of ions (typically 
ranging from 1 to 4). Assuming n= 12, m = 2 and k, = 4 
(like in the small example shown in Figure 1), we obtain 
831 600 alternative configurations. We are not considering 
the rare exceptions in which a CYS or HIS residue can 
bind multiple ions (in the December 2009 release of PDB, 
only 0.9% HIS and 1.6% CYS are found to be within 3 A 
of two different ions). This assumption allows us to 
develop an efficient polynomial-time algorithm (19) for 
geometry prediction. To reduce the output search space 
and improve accuracy, we limit the maximum number of 
ions to 4 (covering 97% of known transition metal sites in 
current PDB). The second key aspect of the task is that the 
participation of a residue to a metal binding site should 
not be predicted independently from the other residues: 
interdependencies between candidates should be taken 
into account to form a collective prediction. These 
aspects strongly suggest solutions based on structured- 
output learning (20). This recent research field aims at 
generalizing learning algorithms, traditionally developed 
for classification or regression tasks, to predict outputs 
consisting of complex structures [like the one shown in 
Figure lc]. 

In MetalDetector2, identification of binding geometry is 
decomposed into two cascaded subtasks. The initial task 
consists of assigning bonding state to every CYS and HIS 
in two states (positive cases are metal-binding residues, 
negative cases are the rest, including half-cystines, i.e. cyst- 
eines forming disulfide bridges). The second task consists 
of grouping together metal-binding CYS and HIS, assign- 
ing them a conventional metal-ion identifier. This process 
is illustrated in Figure 1. Identification of the involved 
chemical element is not attempted. 



(a) 




NKHKFRLHSYSSPTFCDHCGSLLYGLVHQGMKCSCCEMNVHRRCVRSVPSLCGVD 



0 



bonding state identification 



( b ) I NKHKFRLHSYSSPTFCDHCGSLLYGLVHQGMKCSCCEMNVHRRCVRSVPSLCGVD 



binding geometry identification 




Figure 1. Metal binding prediction subtasks. (a): given sequence; 
(b) candidate ligands (CYS and HIS) are assigned bonding state 
(boldface for metal binding); (c) metal-binding residues are grouped 
to form binding site configurations. 



The server uses a combination of different machine 
learning algorithms. The overall operation flow is shown 
in Figure 2. 

Bonding state identification 

This was the only functionality of MetalDetectorl (13) 
and the first stage of prediction in MetalDetector2. In 
Refs (11,13), we used a bidirectional recurrent neural 
network and Viterbi decoding with a simple probabilistic 
automaton to refine local predictions and obtain a collect- 
ive assignment. In MetalDetectorl, it was important 
to train the predictor including examples of non- 
metalloproteins and chains rich in disulfide bridges 
(since otherwise metal-binding CYS and half-cystines 
could be easily confused). When the input chain is not 
known to be a metalloprotein, we still rely on 
MetalDetectorl for prediction (Figure 2). On the other 
hand, if the input chain is known to be a metalloprotein 
(users can select a checkbox in the web interface to 
indicate this knowledge), then half-cystines are rare 
<3% and better accuracy can be obtained by training 
on metalloproteins only. In this case, half-cystines are 
not predicted and we solve the supervised sequence 
labeling task using SVM-HMM (20), a model that can 
be essentially interpreted as a hidden Markov model 
with discriminatively learned parameters, and that collect- 
ively assigns bonding state to all CYS and HIS in the 
sequence. The SVM-HMM sequence is the subsequence 
containing CYS and HIS only and observations (emis- 
sions) for each position include vectors of multiple align- 
ment profiles among other features. Preliminary 
experiments showed that performance difference between 
MetalDetectorl and SVM-HMM is negligible under the 
same experimental conditions, while the latter is much 
simpler to train and engineer. Notably, knowing that a 
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Figure 2. Schematic diagram of methods in MetalDetector v2.0. 



protein binds metal simplifies the prediction task by 
reducing the space of candidate outputs, resulting in 
better prediction accuracy on average. 

Binding geometry identification 

The core and novel feature in MetalDetector2 takes as 
input a protein chain and a (predicted) bonding state as- 
signment and predicts binding geometry. This task is 
formalized as a link prediction in a bipartite graph, 
where a ligand node is connected to an ion node if and 
only if the residue coordinates that ion. In order to solve 
the structured-output learning problem, we introduce a 
function F(x,y) measuring the 'compatibility' between 
the input information x (sequence and bonding state as- 
signment) and every admissible binding geometry y. The 
function is a linear combination of features of both x 
and y. The difficulty in this learning task is the inference 
step where F must be maximized with respect to y (in 
general, this is a hard combinatorial optimization 
problem). It turns out that under relatively mild assump- 
tions, namely that every CYS or HIS coordinates at most 
one metal ion, there exists an optimal greedy algorithm 
that can identify very efficiently the binding configuration 
y that maximizes F — see Ref. (19) for details. Features of 
x and y required to construct F are defined by means of a 
kernel function that defines the similarity between two 
chains. The kernel takes into account several sources of 
information, including the coordination pattern of each 
(predicted) site and multiple alignment profiles. 

THE WEB SERVER INTERFACE 

Input 

The input sequence can be entered either as a plain 
aminoacid string or in FASTA format. The web interface 
allows to choose between three different settings, 



corresponding to the three different paths in Figure 2: 
(i) no prior knowlegde (default operation mode); (ii) the 
chain is known to belong to a metalloprotein; (hi) the 
chain is known to belong to a metalloprotein, and 
the user can also provide (a guess for) the bonding state 
of each CYS and HIS. Note that checking in the web 
interface that a chain is known to bind metal is a form 
of positive evidence (i.e. not checking it means ignorance, 
not negative evidence). This knowledge can be obtained, 
for example, if the protein was annotated as a 
metalloprotein via HT-XAS (7,8). 

Output 

Output is either presented on a separate web page or de- 
livered by via e-mail. It consists of a table having an entry 
for each CYS and HIS, with the indication of its position 
within the sequence, its predicted bonding state and, if the 
residue was predicted as metal bonded, the assigned metal 
ion identifier. Residues predicted to coordinate the same 
ion will share the same identifier. Every identifier is an 
integer ranging from 1 to 4 (maximum number of 
binding sites that can be predicted). Its value has no 
special biochemical semantics but lower values corres- 
ponds to a higher level of confidence for the predictor, 
as the greedy algorithm first builds sites where it is more 
confident. Figure 3 shows a web browser output for PDB 
entry lt3qA. 



RESULTS AND DISCUSSION 

We evaluated performance according to several measures: 

• precision (P B ) and recall (R B ) of residue bonding state; 
precision is the ratio of true positives by the total 
number of residues predicted in metal-bonding state; 
recall or sensitivity is the ratio of true positives by the 
total number of metal-binding residues; 

• precision (P E ) and recall (R E ) of (ligand prediction, i.e. 
assignment of a residue to a metal ion. As we are not 
trying to predict ions of the chemical elements but to 
correctly group together ligands of the same ion, 
equivalence classes due to arbitrary reordering of ion 
identifiers are taken into account. In Figure 1, for 
instance, the correct labeling is {(3,33,36,52), 
(16,19,41,44)}. A prediction like {(16,19,41,52), 
(33,35,36)} would contain five out of seven correct as- 
signments, while the true overall number of ligands is 
eight, giving P E = 5/7 and R E = 5/8. Note that the 
measure also accounts for residues predicted as 
non-metal-binding, like 3 or 44, and non-ligands pre- 
dicted as metal binding, like 35. The former negatively 
affect recall, the latter precision. 

• true-positive hit rate (H T ) and false-positive hit rate 
(H E ) where a hit is counted whenever the intersection 
between a predicted and a true site is non-empty: H T 
is, therefore, the fraction of sites having at least one 
correctly identified ligand, and H F is the fraction of 
predicted sites having no correctly identified residues. 

The server was tested on three distinct data sets, accord- 
ing to the different criteria for redundancy elimination. 
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Figure 3. Output of the predictor for PDB entry H3qA. 
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Table 1. Evaluation of MetalDetector2 



Data set Size P B R B P E R E H T H F 



UniqueProt 199 79 ± 4 88 ± 4 68 ± 4 74 ± 4 93 ± 4 10 ± 3 

SCOP-folds 1824 62 ± 5 71 ± 10 61 ± 6 57 ± 7 70 ± 9 19 ± 4 

SCOP-superfamilies 1466 60 ± 4 74 ± 10 56 ± 6 60 ± 10 74 ± 10 22 ± 5 

PDB 2010 549 60 75 50 62 77 20 



(All the data sets are available online at in the server 
website Supplementary Data). The first data set was 
obtained starting from the one in Ref. (11), where 
redundacy between sequences was removed using 
UniqueProt (21). the 199 metal-binding chains were 
collected from that data set, after removing sites contain- 
ing residues different from CYS/HIS, or with a coordin- 
ation number greater than four. Results in the first row of 
Table 1 are averages of 30 different train/test random 
splits, always in a ratio of 80/20. When starting from 
known bonding state, the predictor achieves on this data 
Pe = Re = 90 ±3. We finally measured accuracy in the 
metalloprotein prediction task (i.e. classifying the whole 
sequence as metalloprotein or not), on the whole data set 
in Ref. (11): MetalDetector v2.0 correctly predicted as 
metalloproteins 65% of the ones in this data set, and as 
non-metalloproteins 96% of the 2362 chains having no 
metal-bonded CYS/HIS. 

The second data set was built according to the 
Structural Classification of Proteins (SCOP) hierarchy 
(22): the goal here was to test the predictor on new (i.e. 
not seen during the training phase) SCOP folds/ 
superfamilies. We started from the December 2009 
release of PDB, extracting 17 783 protein chains with at 
least a CYS or HIS bonded to a metal ion, and we retained 
only those chains which were mapped in SCOP 1.75 
release (June 2009). After removing very few cases of 
chains bonded to more than five ions, we finally 



obtained a sequence-unique data set of 1 824 protein 
chains by running CD-HIT v4.0 (23) with sequence 
identity threshold set to 0.9 (default value). 

Using this second data set, we partitioned the chains in 
10 different subsets, maintaining the same average per- 
centage of ligands in each subset, and allowing no pair 
of chains in different subsets to belong to the same 
SCOP superfamily. In a second version of this data set, 
we considered SCOP folds instead of superfamilies, and 
we therefore had to discard multi domain chains, as 
building the partition would have been otherwise unfeas- 
ible: this version of the data set was therefore reduced to 
1466 chains. We trained 10 different models, using 9 of the 
subsets as the training set and the remaining subset as 
the test set. Results are summarized in the second and 
the third row in Table 1. Performance measures are 
averaged on the 10 splits. 

The predictor available on the web server was trained 
on the whole SCOP-based data set. As a final test, we 
extracted 549 metal-bonded chains from PDB entries de- 
posited in 2010 (after removing duplicates). Performance 
of the web server on this data set is reported in the fourth 
row of Table 1 . Results in this setting are comparable to 
those obtained on the SCOP-based data sets. 

In the Supplementary Data, we show the breakdown of 
prediction performance according to the number of 
coordinating ligands per ion. These results indicate that 
in the majority of cases MetalDetector2 is capable of 
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identifying most of the binding site: in PDB 2010 data set, 
for example, among the 268 sites having 2 coordinating 
residues, MetalDetector2 correctly identifies both residues 
in 41.6% of the cases and one of the two 42.0% of the 
times. In 65 and 62% of the cases, the server misses at 
most one ligand in the sites with three and four 
coordinating residues, respectively. Concerning precision, 
at least half of the returned candidates actually belong to 
the site on average. 



CONCLUSION 

This release of MetalDetector adds an important feature 
to metalloproteins prediction, namely the ability to 
identify the number of binding sites and the involved 
CYS and HIS ligands. Unlike existing servers that can 
perform this task, MetalDetector does not rely on 3D 
structure similarity and can predict binding sites of 
proteins in novel folds. 
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